Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions
نویسندگان
چکیده
Map/Reduce style data-parallel computation is characterized by the extensive use of user-defined functions for data processing and relies on data-shuffling stages to prepare data partitions for parallel computation. Instead of treating user-defined functions as “black boxes”, we propose to analyze those functions to turn them into “gray boxes” that expose opportunities to optimize data shuffling. We identify useful functional properties for user-defined functions, and propose SUDO, an optimization framework that reasons about data-partition properties, functional properties, and data shuffling. We have assessed this optimization opportunity on over 10,000 data-parallel programs used in production SCOPE clusters, and designed a framework that is incorporated it into the production system. Experiments with real SCOPE programs on real production data have shown that this optimization can save up to 47% in terms of disk and network I/O for shuffling, and up to 48% in terms of crosspod network traffic.
منابع مشابه
Continuous Cloud-Scale Query Optimization and Processing
Massive data analysis in cloud-scale data centers plays a crucial role in making critical business decisions. Highlevel scripting languages free developers from understanding various system trade-offs, but introduce new challenges for query optimization. One key optimization challenge is missing accurate data statistics, typically due to massive data volumes and their distributed nature, comple...
متن کاملParallel computation framework for optimizing trailer routes in bulk transportation
We consider a rich tanker trailer routing problem with stochastic transit times for chemicals and liquid bulk orders. A typical route of the tanker trailer comprises of sourcing a cleaned and prepped trailer from a pre-wash location, pickup and delivery of chemical orders, cleaning the tanker trailer at a post-wash location after order delivery and prepping for the next order. Unlike traditiona...
متن کاملOptimization of Agricultural BMPs Using a Parallel Computing Based Multi-Objective Optimization Algorithm
Beneficial Management Practices (BMPs) are important measures for reducing agricultural non-point source (NPS) pollution. However, selection of BMPs for placement in a watershed requires optimizing available resources to maximize possible water quality benefits. Due to its iterative nature, the optimization typically takes a long time to achieve the BMP trade-off results which is not desirable ...
متن کاملOptimizing Data Partitioning for Data-Parallel Computing
Performance of data-parallel computing (e.g., MapReduce, DryadLINQ) heavily depends on its data partitions. Solutions implemented by the current state of the art systems are far from optimal. Techniques proposed by the database community to find optimal data partitions are not directly applicable when complex user-defined functions and data models are involved. We outline our solution, which dr...
متن کاملOptimizing Membership Functions using Learning Automata for Fuzzy Association Rule Mining
The Transactions in web data often consist of quantitative data, suggesting that fuzzy set theory can be used to represent such data. The time spent by users on each web page is one type of web data, was regarded as a trapezoidal membership function (TMF) and can be used to evaluate user browsing behavior. The quality of mining fuzzy association rules depends on membership functions and since t...
متن کامل